In this project I am going to analyse the data about best movies from website metacritic (https://www.metacritic.com/browse/movies/score/metascore/all/filtered?sort=desc). The goal of the project is to scrap different interesting characteristics of the movies such as year of release, director, distributor, runtime, country, and metascore. Then to run simple EDA to understand data better and, finally, I would like to study what factors influence runtime of the movies.
library(rvest)
library(dplyr)
url <- "https://www.metacritic.com/browse/movies/genre/metascore/thriller?view=detailed"
page <- read_html(url)
title <- page %>%
html_nodes(".title h3") %>%
html_text()
metascore <- page %>%
html_nodes(".clamp-score-wrap .positive") %>%
html_text()
date <- page %>%
html_nodes(".clamp-details span:nth-child(1)")%>%
html_text()
Now I need data that I can get only from nested links.
First I need a machine to scrap nested links for all movies by the following code:
movie_links <- page %>%
html_nodes("a.title") %>%
html_attr("href") %>%
paste("https://www.metacritic.com", ., sep="")
Second I need a function that finds distributors for each movie:
get_distributor <- function(distributor) {
movie_page <- read_html(distributor)
movie_distributor <- movie_page %>%
html_nodes(".distributor a") %>%
html_text()
return(movie_distributor)
}
Now I need to apply this function to all nested links:
distributor <- sapply(movie_links, FUN = get_distributor)
The same procedure to get directors for each movie:
get_director <- function(director) {
movie_page <- read_html(director)
movie_director <- movie_page %>%
html_nodes(".director a span") %>%
html_text() %>% paste(collapse = ",")
return(movie_director)
}
Now I need to apply this function to all nested links:
director <- sapply(movie_links, FUN = get_director)
However, even in nested links not all information is available, because there is an additional button “See All Details and Credits”. When I click on it, the text “details” is added to the nested link.
For example: https://www.metacritic.com/movie/the-godfather -> https://www.metacritic.com/movie/the-godfather/details
And this text is added to all films. Thus, I can just add the text “/details” to all links that were previously scraped and get new variables:
get_country <- function(link) {
details_link <- paste(link, "/details", sep="")
details_page <- read_html(details_link)
country <- details_page %>%
html_nodes(".countries span") %>%
html_text() %>% paste(collapse = ",")
return(country)
}
country <- sapply(movie_links, FUN=get_country)
get_runtime <- function(runtime) {
movie_page <- read_html(runtime)
movie_runtime <- movie_page %>%
html_nodes(".runtime .label+ span") %>%
html_text()
return(movie_runtime)
}
Now I need to apply this function to all nested links:
runtime <- sapply(movie_links, FUN = get_runtime)
Create dataframe with all scraped variables:
movies_df <- data_frame(title, date, metascore, distributor, director, country, runtime)
glimpse(movies_df)
## Rows: 100
## Columns: 7
## $ title <chr> "The Godfather", "Rear Window", "Vertigo", "Notorious", "T…
## $ date <chr> "March 24, 1972", "September 1, 1954", "May 28, 1958", "Se…
## $ metascore <chr> "100", "100", "100", "100", "99", "98", "98", "98", "98", …
## $ distributor <chr> "Paramount Pictures", "Paramount Pictures", "Paramount Pic…
## $ director <chr> "Francis Ford Coppola", "Alfred Hitchcock", "Alfred Hitchc…
## $ country <chr> "USA", "US", "US", "US", "US", "GB", "USA,Spain,Mexico", "…
## $ runtime <named list> "175 min", "112 min", "128 min", "101 min", "95 min…
Everything looks good, except for date and runtime. I need only years and runtime without minutes for analysis, not full date.
Let’s extract years from date and save them into new column:
movies_df$year <- sub(".*,", "", movies_df$date)
Let’s extract years from date and save them into new column:
movies_df$time <- sub("min", "", movies_df$runtime)
However, there is also a small problem with country
variables, because some movies actually contains several countries.
movies_df$country_1 <- sub(",.*", "", movies_df$country)
The same problem with director variable.
movies_df$director_1 <- sub(",.*", "", movies_df$director)
movies_df <- movies_df %>%
select(- runtime, - date, - director, - country)
Also, there is an another problem with country variable,
because the same countries are written differently (e.g., US and
USA)
movies_df <- movies_df %>%
mutate(country_1 = recode(country_1,
"US" = "USA",
"GB" = "UK",
"DE" = "Germany",
"JP" = "Japan",
"Hong Kong" = "China"))
movies_df$metascore <- as.numeric(movies_df$metascore)
movies_df$year <- as.numeric(movies_df$year)
movies_df$time <- as.numeric(movies_df$time)
movies_df$country_1 <- as.factor(movies_df$country_1)
movies_df$distributor <- as.factor(movies_df$distributor)
Now data is clean, correct and ready for analysis.
glimpse(movies_df)
## Rows: 100
## Columns: 7
## $ title <chr> "The Godfather", "Rear Window", "Vertigo", "Notorious", "T…
## $ metascore <dbl> 100, 100, 100, 100, 99, 98, 98, 98, 98, 97, 97, 97, 97, 97…
## $ distributor <fct> "Paramount Pictures", "Paramount Pictures", "Paramount Pic…
## $ year <dbl> 1972, 1954, 1958, 1946, 1958, 1938, 2006, 1959, 2002, 1955…
## $ time <dbl> 175, 112, 128, 101, 95, 96, 118, 136, 153, 92, 104, 95, 12…
## $ country_1 <fct> USA, USA, USA, USA, USA, UK, USA, USA, Germany, USA, UK, U…
## $ director_1 <chr> "Francis Ford Coppola", "Alfred Hitchcock", "Alfred Hitchc…
var_names <- movies_df %>%
rename(`Year of release` = year,
`Runtime of movie` = time,
`Metascore` = metascore,
`Country` = country_1,
`Distributor` = distributor)
var_names <- var_names %>%
select(- title, - director_1)
caption_1 <- "Table 1. Sample descriptive statistics for continious variables"
library(modelsummary)
datasummary_skim(var_names, title = caption_1)
| Unique (#) | Missing (%) | Mean | SD | Min | Median | Max | ||
|---|---|---|---|---|---|---|---|---|
| Metascore | 15 | 0 | 91.4 | 4.0 | 86.0 | 90.5 | 100.0 | |
| Year of release | 61 | 0 | 1985.4 | 27.1 | 1926.0 | 1988.0 | 2023.0 | |
| Runtime of movie | 65 | 2 | 119.8 | 33.2 | 72.0 | 113.0 | 325.0 |
caption_2 <- "Table 2. Sample descriptive statistics for categorical variables"
datasummary_skim(var_names, type = "categorical", title = caption_2)
| N | % | ||
|---|---|---|---|
| Distributor | A24 | 3 | 3.0 |
| ARRAY Releasing | 1 | 1.0 | |
| British Lion Film Corporation | 2 | 2.0 | |
| Bryanston Distributing | 1 | 1.0 | |
| Cinelicious Pics | 1 | 1.0 | |
| Cineriz | 1 | 1.0 | |
| CJ Entertainment | 1 | 1.0 | |
| Columbia Pictures | 6 | 6.0 | |
| Compass International Pictures | 1 | 1.0 | |
| Filmways Pictures | 1 | 1.0 | |
| Fine Line Features | 1 | 1.0 | |
| Fox Searchlight Pictures | 1 | 1.0 | |
| Gaumont British Distributors | 1 | 1.0 | |
| Geffen Company, The | 1 | 1.0 | |
| Goskino | 1 | 1.0 | |
| Gramercy Pictures (I) | 1 | 1.0 | |
| Grasshopper Film | 1 | 1.0 | |
| Home Box Office (HBO) | 1 | 1.0 | |
| IFC Films | 1 | 1.0 | |
| Janus Film | 1 | 1.0 | |
| Lopert Pictures Corporation | 1 | 1.0 | |
| Metro-Goldwyn-Mayer (MGM) | 5 | 5.0 | |
| Miramax | 1 | 1.0 | |
| Miramax Films | 3 | 3.0 | |
| Motion Picture Export Association (MPEA) | 1 | 1.0 | |
| Neon | 1 | 1.0 | |
| Netflix | 1 | 1.0 | |
| Newmarket Films | 1 | 1.0 | |
| Open Road Films (II) | 1 | 1.0 | |
| Paramount Pictures | 11 | 11.0 | |
| Paramount Vantage | 1 | 1.0 | |
| Picturehouse | 1 | 1.0 | |
| Pierre Grise Distribution | 1 | 1.0 | |
| Rialto Pictures | 2 | 2.0 | |
| RKO Radio Pictures | 1 | 1.0 | |
| Roadside Attractions | 1 | 1.0 | |
| Royal Films International | 1 | 1.0 | |
| Samuel Goldwyn Films | 1 | 1.0 | |
| Selznick Releasing Organization | 1 | 1.0 | |
| Sony Pictures Classics | 3 | 3.0 | |
| Summit Entertainment | 1 | 1.0 | |
| The Cinema Guild | 2 | 2.0 | |
| Times Film Corporation | 1 | 1.0 | |
| Toho Company | 2 | 2.0 | |
| Turtle Releasing | 1 | 1.0 | |
| Twentieth Century Fox Film Corporation | 2 | 2.0 | |
| United Artists | 9 | 9.0 | |
| Universal Pictures | 6 | 6.0 | |
| Warner Bros. | 6 | 6.0 | |
| Warner Bros. Pictures | 3 | 3.0 | |
| Country | AU | 1 | 1.0 |
| FR | 1 | 1.0 | |
| France | 4 | 4.0 | |
| Germany | 9 | 9.0 | |
| India | 1 | 1.0 | |
| IR | 1 | 1.0 | |
| IT | 1 | 1.0 | |
| Japan | 3 | 3.0 | |
| KR | 1 | 1.0 | |
| Spain | 1 | 1.0 | |
| SUHH | 1 | 1.0 | |
| UK | 11 | 11.0 | |
| USA | 65 | 65.0 |
library(plotly)
plot_ly(movies_df, x =~time, y=~metascore, type = 'scatter', mode = 'markers') %>%
layout(title = 'Correlation between time and metascore',
xaxis = list(title = 'Runtime'),
yaxis = list(title = 'Metascore'))
plot_ly(movies_df, x =~year, y=~metascore, type = 'scatter', mode = 'markers') %>%
layout(title = 'Correlation between year of release and metascore',
xaxis = list(title = 'Year'),
yaxis = list(title = 'Metascore'))
What factors influence runtime of the movies?
To be able to use categorical variable country, I need
to decrease the number of categories there. I decided to create binary
variable, that reflects whether the country of the film USA or not.
movies_df_2 <- movies_df %>%
mutate(country_binary = ifelse(country_1 %in% c('USA'), 'USA', 'not_USA'))
library(sjPlot)
labs = c("Constant", "Year of release",
"Meta score",
"Country (USA)")
model <- lm(time ~ year + metascore + country_binary, data = movies_df_2)
tab_model(model, pred.labels = labs, title = "Table 1. Linear regression: Factors that influence runtime of the best movies of all times",
dv.labels = "Runtime")
| Runtime | |||
|---|---|---|---|
| Predictors | Estimates | CI | p |
| Constant | -812.05 | -1347.24 – -276.86 | 0.003 |
| Year of release | 0.40 | 0.16 – 0.64 | 0.001 |
| Meta score | 1.51 | -0.13 – 3.14 | 0.070 |
| Country (USA) | -0.07 | -13.64 – 13.49 | 0.991 |
| Observations | 98 | ||
| R2 / R2 adjusted | 0.115 / 0.087 | ||
Every one unit increase in year of release leads to 0.4 increase in runtime of movies on average, holding everything else constant (p-value = 0.001).
Other variables are not statistically significant at explaining runtime of the movies.
Adjusted R-squared equals to 0.115, it means that only 12% of variance in runtime of movies can be explained by the model. Thus, I can conclude that the explanatory power is not good enough.